NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

The Decaying Missing-at-Random Framework: Model Doubly Robust Causal Inference with Partially Labeled Data

Zhang, Yuqian; Chakrabortty, Abhishek; Bradic, Jelena (April 2025, arXiv.org)

In modern large-scale observational studies, data collection constraints often result in partially labeled datasets, posing challenges for reliable causal inference, especially due to potential labeling bias and relatively small size of the labeled data. This paper introduces a decaying missing-at-random (decaying MAR) framework and associated approaches for doubly robust causal inference on treatment effects in such semi-supervised (SS) settings. This simultaneously addresses selection bias in the labeling mechanism and the extreme imbalance between labeled and unlabeled groups, bridging the gap between the standard SS and missing data literatures, while throughout allowing for confounded treatment assignment and high-dimensional confounders under appropriate sparsity conditions. To ensure robust causal conclusions, we propose a bias-reduced SS (BRSS) estimator for the average treatment effect, a type of 'model doubly robust' estimator appropriate for such settings, establishing asymptotic normality at the appropriate rate under decaying labeling propensity scores, provided that at least one nuisance model is correctly specified. Our approach also relaxes sparsity conditions beyond those required in existing methods, including standard supervised approaches. Recognizing the asymmetry between labeling and treatment mechanisms, we further introduce a de-coupled BRSS (DC-BRSS) estimator, which integrates inverse probability weighting (IPW) with bias-reducing techniques in nuisance estimation. This refinement further weakens model specification and sparsity requirements. Numerical experiments confirm the effectiveness and adaptability of our estimators in addressing labeling bias and model misspecification.
more » « less
Free, publicly-accessible full text available April 21, 2026
Double robust semi-supervised inference for the mean: selection bias under MAR labeling with decaying overlap

https://doi.org/10.1093/imaiai/iaad021

Zhang, Yuqian; Chakrabortty, Abhishek; Bradic, Jelena (July 2023, Information and Inference: A Journal of the IMA)

Abstract Semi-supervised (SS) inference has received much attention in recent years. Apart from a moderate-sized labeled data, $$\mathcal L$$, the SS setting is characterized by an additional, much larger sized, unlabeled data, $$\mathcal U$$. The setting of $$|\mathcal U\ |\gg |\mathcal L\ |$$, makes SS inference unique and different from the standard missing data problems, owing to natural violation of the so-called ‘positivity’ or ‘overlap’ assumption. However, most of the SS literature implicitly assumes $$\mathcal L$$ and $$\mathcal U$$ to be equally distributed, i.e., no selection bias in the labeling. Inferential challenges in missing at random type labeling allowing for selection bias, are inevitably exacerbated by the decaying nature of the propensity score (PS). We address this gap for a prototype problem, the estimation of the response’s mean. We propose a double robust SS mean estimator and give a complete characterization of its asymptotic properties. The proposed estimator is consistent as long as either the outcome or the PS model is correctly specified. When both models are correctly specified, we provide inference results with a non-standard consistency rate that depends on the smaller size $$|\mathcal L\ |$$. The results are also extended to causal inference with imbalanced treatment groups. Further, we provide several novel choices of models and estimators of the decaying PS, including a novel offset logistic model and a stratified labeling model. We present their properties under both high- and low-dimensional settings. These may be of independent interest. Lastly, we present extensive simulations and also a real data application.
more » « less
Testability of high-dimensional linear models with nonsparse structures

https://doi.org/10.1214/19-AOS1932

Bradic, Jelena; Fan, Jianqing; Zhu, Yinchu (April 2022, The Annals of Statistics)

Full Text Available
High-dimensional semi-supervised learning: in search of optimal inference of the mean

https://doi.org/10.1093/biomet/asab042

Zhang, Yuqian; Bradic, Jelena (September 2021, Biometrika)

Summary A fundamental challenge in semi-supervised learning lies in the observed data’s disproportional size when compared with the size of the data collected with missing outcomes. An implicit understanding is that the dataset with missing outcomes, being significantly larger, ought to improve estimation and inference. However, it is unclear to what extent this is correct. We illustrate one clear benefit: root-$$n$$ inference of the outcome’s mean is possible while only requiring a consistent estimation of the outcome, possibly at a rate slower than root $$n$$. This is achieved by a novel $$k$$-fold, cross-fitted, double robust estimator. We discuss both linear and nonlinear outcomes. Such an estimator is particularly suited for models that naturally do not admit root-$$n$$ consistency, such as high-dimensional, nonparametric or semiparametric models. We apply our methods to estimating heterogeneous treatment effects.
more » « less
Full Text Available
Comments on Leo Breiman's paper: "Statistical Modeling: The Two Cultures" (Statistical Science, 2001, 16(3), 199-231)

https://doi.org/10.1353/obs.2021.0019

Bradic, Jelena; Zhu, Yinchu (January 2021, Observational Studies)

Full Text Available
Confidence intervals for high-dimensional Cox models

https://doi.org/10.5705/ss.202018.0247

Yu, Yi; Bradic, Jelena; Samworth, Richard J. (January 2021, Statistica Sinica)

Full Text Available
Fixed Effects Testing in High-Dimensional Linear Mixed Models

https://doi.org/10.1080/01621459.2019.1660172

Bradic, Jelena; Claeskens, Gerda; Gueuning, Thomas (October 2020, Journal of the American Statistical Association)

Full Text Available
A Tuning-free Robust and Efficient Approach to High-dimensional Regression

https://doi.org/10.1080/01621459.2020.1840989

Wang, Lan; Peng, Bo; Bradic, Jelena; Li, Runze; Wu, Yunan (October 2020, Journal of the American Statistical Association)
null (Ed.)
Full Text Available
Rejoinder to “A Tuning-Free Robust and Efficient Approach to High-Dimensional Regression”

https://doi.org/10.1080/01621459.2020.1843865

Wang, Lan; Peng, Bo; Bradic, Jelena; Li, Runze; Wu, Yunan (October 2020, Journal of the American Statistical Association)
null (Ed.)
Full Text Available
Censored Quantile Regression Forest

Li, Alexander Hanbo; Bradic, Jelena (January 2020, Proceedings of Machine Learning Research)
Chiappa, Silvia; Calandra, Roberto (Ed.)
Random forests are powerful non-parametric regression method but are severely limited in their usage in the presence of randomly censored observations, and naively applied can exhibit poor predictive performance due to the incurred biases. Based on a local adaptive representation of random forests, we develop its regression adjustment for randomly censored regression quantile models. Regression adjustment is based on a new estimating equation that adapts to censoring and leads to quantile score whenever the data do not exhibit censoring. The proposed procedure named censored quantile regression forest, allows us to estimate quantiles of time-to-event without any parametric modeling assumption. We establish its consistency under mild model specifications. Numerical studies showcase a clear advantage of the proposed procedure.
more » « less
Full Text Available

« Prev Next »

Search for: All records